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Abstract. We present a definition of the distance between probability 
distributions. Our definition is based on the L\ norm on space of probability 
measures. We compare our distance with the well-known Kullback-Leibler 
divergence and with the proper distance defined using the Fisher matrix as a 
metric on the parameter space. We consider using our notion of distance in 
several problems in gravitational wave data analysis: to place templates in the 
parameter space in searches for gravitational-wave signals, to assess quality of 
search templates, and to study the signal resolution. 
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1. Introduction 

A number of long baseline interferometric gravitational wave detectors are working 
around the world: in the USA (LIGO project), in Italy (VIRGO project, a 
joint French-Italian collaboration), in Germany (GEO600 project), and in Japan 
(TAMA300 project). A network of resonant bar detectors continues its operation: 
in the USA (ALLEGRO detector) and in Italy (detectors AURIGA, NAUTILUS and 
EXPLORER (located in CERN near Geneva)). These detectors are collecting a large 
amount of data that are being currently analyzed. There is a proposed space borne 
detector LISA to be launched by NASA and ESA in the next decade. The quest 
for gravitational waves in the data requires optimal statistical methods and efficient 
numerical algorithms to search over very large parameter spaces [U H] . A standard 
method is the maximum likelihood detection method, which consists of searching for 
local maxima of the likelihood function with respect to the parameters [7] . Assuming 
Gaussian noise in the detector, the maximum likelihood method consists of correlating 
the data with templates defined over the parameter space. In this paper we introduce 
a new tool for data analysis - a distance between the probability density functions. 
This distance can be used to define the covering radius to design an optimal (with 
smallest number of nodes) grid in the parameter space. One can use the distance to 
determine the quality of the search templates - simplified models of the signal over 
a reduced parameter space. The distance can also be used to study the problem of 
signal resolution - a problem that occurs in the estimation of parameters of white dwarf 
binary systems in the data of planned space-borne detector LISA. This distance would 
play the same role in gravitational wave data analysis as the line element defined by the 
Fisher matrix interpreted as a Riemannian metric. However Fisher matrix is obtained 
as a Taylor expansion up to the second order term of the Kullback-Leibler divergence 
and therefore is only an approximation. Moreover the distance we introduce fulfills 
triangle inequality which is not true for Kullback-Leibler divergence from which the 
Fisher matrix is obtained. Moreover our distance can be defined for any probability 
densities, not only smooth and not only absolutely continuous with respect to each 
other. 

In Section 2 we shall briefly review the problem of signal detection and parameter 
estimation. In Section 3 we shall motivate and introduce the Li-norm distance between 
two probability density functions. We show that as a consequence of the triangle 
inequality, our distance is an appropriate tool in several data analysis problems. In 
Section 4 we shall review the Kullback-Leibler divergence. In Section 5 we shall 
calculate the Li-norm and the Kullback-Leibler divergence in several cases important 
for applications, namely for the case of Gaussian probability density functions. In 
Section 6 we shall discuss applications of the Li-norm to the problem of signal 
resolution, template placement and search template design for the simple case of a 
monochromatic signal. Section 7 concludes our paper. 

2. Problem of signal detection and parameter estimation 

2.1. Signal detection and parameter estimation in Gaussian noise 

Suppose that we want to detect a known signal s embedded in noise n. The signal 
detection problem can be posed as a hypothesis testing problem, where the null 
hypothesis is that the signal is absent, and the alternative hypothesis is that the 
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signal is present. A solution to this problem has been found by Neyman and Pearson 
[5]. They have shown that, subject to a given false alarm probability, the test that 
maximizes the detection probability is the likelihood ratio test. Assuming that the 
noise is additive, the data time series x(t) can be written as 

x(t) =n(t) + s(t). (1) 

In addition if the noise is a zero-mean, stationary, and Gaussian random process, the 
log likelihood function is given by 

logA = (z| S )-i( S | S ), (2) 

where the scalar product ( • | • ) is defined by 

(x\y):=m fgd/. (3) 
V W Jo S(f) K ' 

In Eq. ([3]) denotes the real part of a complex expression, the tilde denotes the Fourier 
transform, the asterisk is complex conjugation, and S is the one-sided spectral density 
of the noise in the detector. Equation ^ is called the Cameron- Martin formula [7J. 
From the Cameron-Martin formula we immediately see that the in the Gaussian case, 
the likelihood ratio test consists of correlating the data x with the signal s that is 
present in the noise and comparing the correlation to a threshold. Such a correlation 
G = (x\s) is called the matched filter. The matched filter is a linear operation on the 
data. 

An important quantity is the optimal signal-to-noise ratio p defined by 

'-"■•[^ (4) 

Since data x are Gaussian and G is linear in x, it has a normal probability 
density function. Probability density distributions po and p\ of correlation G when 
respectively signal is absent and present are given by. 

1 1 (C — r?\ 2 

Probability of false alarm Q p and of detection Q d are readily expressed in terms of 
error functions. 

Q F = \[l-erf{±=^-)\ (7) 

Q D = \[l-erf{±={^--p))], (8) 
where G D is the threshold and the error function erf is defined as 

erf(x) = 4= / e ~* 2 (9) 

V 7 !" JO 

Thus to detect the signal we proceed as follows. We choose a certain value of the 
false alarm probability. From Eq. ([JJ we calculate the threshold G Q . We evaluate 
the correlation G. If G is larger than the threshold G D we say that the signal is 
present. We see that in the Gaussian case, a single parameter - signal-to-noise ratio 
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p determines both probabilities - of false alarm and detection, and consequently the 
receiver's operating characteristic. For a given false alarm probability, the greater the 
signal-to-noise ratio, the greater the probability of detection of the signal. 

In general, we know the signal as a function of several unknown parameters 9. 
Thus to detect the signal we also need to estimate its parameters. A convenient 
method is the maximum likelihood method, by which estimators are those values of 
the parameters that maximize the likelihood ratio. Thus the maximum likelihood 
estimators 9 of parameters 9 are obtained by solving the set of equations 

where 9i is the ith parameter. The quality of any parameter estimation method can 
be assessed using the Fisher information matrix T and the Cramer - Rao bound [9] • 
The components of this matrix are defined by 



Tij := E 



8 log A 8 log A 



-E 



8 2 log A 



89,89 3 



(11) 



89i 89j 

The Cramer-Rao bound states that for unbiased estimators, the covariance matrix of 
the estimators C > (The inequality A> B for matrices means that the matrix 

A — B is nonnegative definite.) In the case of Gaussian noise, the formula for the 
Fisher matrix takes the form 

where the scalar product ( • | • ) is given by Eq. §S§ . 
2. 2. The case of a monochromatic signal 

Let us consider an application of the maximum likelihood estimation method to the 
case of a simple signal - a monochromatic signal. The monochromatic signal depends 
on three parameters: amplitude A a , phase <fr Q , and angular frequency iv , and it has 
the form 

s = A D cos(ui t — <fro). (13) 
Let us rewrite the signal (fT3"|) as 

s = A c cos(u! t) + A s sm(u> t), (14) 

where 

A c = A cos4> , (15) 
A s = A sin <fr . (16) 

Using Parseval's theorem and assuming that the observation time T is much longer 
than the period 27r/w , we have 

T 

(cos{uj t)\cos{ujot)) ~ (sin(w t)| sin{u t)) ~ — , (17) 
(cos(w i)|sin(w i)) ~ 0, (18) 

where S is the one-sided spectral density of noise of the detector at frequency ui Q . 
Thus the log likelihood ratio is approximately given by 

T 1 
logA = 2— [A c < xcos(uj t) > -M s < xsin(cu t) > --(A 2 c +A 2 a )}, (19) 
o 2 
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where the operator < • > is defined as 

i-T i-l 

g(zT)dz, (20) 



<g(t)>=^J Q g(t)dt = J 



where the last equation follows by introducing a dimensionless time variable z = t/T . 
The maximum likelihood estimators of A c and A s amplitudes A c and A s can be 
obtained in a closed analytic form by solving the set of the following two linear 
equations: 

< xcos(uj t) > -A c = 0, (21) 

< xcos(ui t) > -A s = 0. (22) 

We have 

A c =< xcos(u t) >, (23) 
A s =< xsm(uj t) > . 

By substituting the maximum likelihood estimators of amplitudes into the log 
likelihood ratio we get 

T 

logA r = — [< xcos(u> t) > 2 + < xsm(iv t) > 2 ]. (24) 

Do 

We shall denote the reduced likelihood ratio log A r by T and we shall call it the T - 
statistic. The maximum likelihood estimators <j) and A a of the phase and amplitude 
are given by 

o =atan - t^ttt » 25 

< XCOS(Ld t) > 

A = y/< xcos(u; t) > 2 + < xsm(uj t) 2 >. (26) 

Thus to find the maximum likelihood estimators of parameters of the monochromatic 
signal, we first find the maximum of the T - statistic with respect to angular 
frequency, and the angular frequency uj corresponding to the maximum of T is the 
maximum likelihood estimator of u) Q . Then we use Eqs. (|25|) with lo q — Cj to find 
the maximum likelihood estimators of phase and amplitude. The maximum likelihood 
detection method consists of correlating the data x with two filters F c = cos(w t) and 
F s = sin(w t). We easily see that the T - statistic is invariant with respect to the 
following transformation of the filters 

sin(tj t) — > Ap sm(uj t + 4>f), (27) 

cos(w £) — > Af cos(w £ + </>f), (28) 

where Ap and <j)p are arbitrary constants. 

With the above approximations, the signal-to-noise ratio p and the Fisher matrix 
r for the signal (jT3|) are given by 

P 2 =A 2 ^, (29) 



O Q 7 
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where i,j = {A 0: cj) ,uj ). 
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3. Li-norm distance 

3.1. Motivation 

Let I denote a probabilistic space of events i £ I; for the sake of clarity we will 
consider / to be a set of finitely many elements in this initial discussion (generalization 
to continuous probabilistic spaces and measures will be commented upon later in this 
section). Let p(i), q(i) be two different probability distributions (strictly speaking: 
probabilistic measures) defined on /. 

We propose to define the distance dj, between two probability distributions p and 

q as 



where the sum is taken over the whole event space /. 

The space of probability distributions with the above distance is a metric space 
because the distance fulfills all the axioms of a metric i.e. d(p, q) = if and only if 
p = q, d(p, q) = d(q,p) and the triangle equality holds: d(p, q) + d(q, r) > d(p, r). 

Remark: dr,(p, q) < 1 for any p, q; the maximal value is achieved if and only if p 
and q have disjoint supports, i.e. when for every i e / either p = or q = 0. 

We motivate the above definition by the following example. 

Consider the following situation: we are performing an observation or 
measurement on a system where either of two random processes may be operating 
at a given time. To our best knowledge, each of the two processes is described by a 
determined probability distribution of the measured results; moreover, we conjecture 
(or know) the probabilities that each of these processes is operating in a given instance 
of the measurement or observation - alternatively, if such knowledge is lacking, we 
assume that each of the two is equally likely to be in effect. In every instance of 
the experiment, we need to decide, based on the obtained result, which of the two 
processes was more likely to be operating, and we need to somehow estimate our 
overall confidence in these decisions, based on how much the two pre-determined 
probability distributions differ over the space of observed results. 

To be more precise, let / be the space of possible outcomes of our measurements, 
and let p(i), q(i) (i G /) be the probabilities that the result i is obtained when the 
process V (respectively, Q) is in operation. Recall that we are regarding / to be a finite 
set, as is in fact the case in many realistic experimental setups (even if the number of 
elements of / is typically quite large). 

We consider the special case when both processes are equally likely, a natural 
assumption when we lack any a priori knowledge about the probability of cither 
process being in operation for a given event. 

Clearly, the total probability that a measurement will produce the outcome i <E I 
is given by the combination 



while the conditional probability that, assuming the outcome io £ I was obtained, 
it resulted from the process V, is 




(31) 




p{io) 



p(i )+q(io) 
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and likewise for process Q: 
q(io) 



p(i )+q{i )' 

We see that the relative likelihood of each process being in operation, given a 
specific outcome, leads to basing our decision on the partition of the space / into two 
subsets 



and 



{i e / : p(i) > q(i)} 



Q = {tEl: q(i) > p(i)}, 



where if the outcome is i 6 P it is more likely to result from process V, and 
correspondingly for «o £ Q and Q. 

Note that we have neglected here the subset of / where p(i) — q(i); i.e., where 
both processes were both likely to be operating. For such outcomes the decision is 
arbitrary and we will see below that it has no impact upon our conclusions. 

Now, in every case (for any value of io G /) the greater likelihood rule stated 
above might lead us to err; the probability that a decision in favor of the more likely 
of either V or Q is mistaken, conditional upon the outcome being i , is given by 

PE(i ) = f| o) ( io e P) 
p(»o) + q(io) 

and 

PE(i ) = f) o) (i e Q). 

Under our assumptions, the overall probability that the greater likelihood rule will fail 
is therefore obtained by summing the above, weighted by the probability of the result 
being i , over all of I: 

ieP i£Q 

For simplicity we ignore here the fact that this might need to be corrected by 
adding 1/2 times the probability of hitting the subset of I where p(i) — q(i) (whatever 
the arbitrary rule applied to decide for such outcomes, there is a probability of 1/2 
that it is wrong). 

Note that the overall probability of error PE given above may be, in the extreme 
case, equal to zero - this, when the probability distributions p(i) and q(i) have disjoint 
supports, or (in the other extreme) to 1/2 - this is obtained when both distributions 
arc identical. 

The above formula may be simplified to 

P£=^mim>(*), <?(*))• 
1 iei 

Note that this form already takes correctly into account the possibility that 
p(i) = q(i) for some values of i. 

Further simplification follows by using 

mm(p(i), q(i)) = hp(i) + q(l) - \p(i) - q(i)\) 
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which leads to 

PE = JX>W + 0«-|p(i)-9(i)l) 
iei 

and, ultimately, to 

pe = \-\y j w)-^)\ 

iei 

i.e. 

PE= 1 -{l-d L {p,q)). (32) 

Obviously, when p = q, and consequently, d^p^q) = 0, PE = 1/2 - expressing 
the fact that it is impossible to distinguish between two processes whose experimental 
outcomes are identical. The other extreme is achieved when PE = 0: this follows 
when dL(p,q) — 1, which, as remarked before, happens if and only if p and q have 
disjoint supports. In other words, when every possible result of measurement can 
be produced by only one of the two processes under consideration, there can be no 
mistake in determining which of those processes was operating. 

From the above we see that our definition of the Li-norm distance between 
probability distributions admits a natural probabilistic interpretation: the Li-norm 
distance is closely related to the level of reliability of the rule of greater likelihood, 
applied to determining which of two a priori equally likely processes is being observed 
in an experiment. 

3.2. Continuous probabilistic space 

The above discussion, beginning with the definition of the ii-norm distance, may be 
generalized to the case when J is a continuous probabilistic space, with the obvious 
substitutions of sums by integrals etc. To perform this generalization rigorously, it 
is simplest to assume the existence of a "reference" measure p on I such that both 
probabilistic measures involved in the definition of d^ are absolutely continuous with 
respect to p, and may therefore be represented by non- negative functions p(i), q(i). 
The distance di,(p, q) is then given by 

dh{p, <l) = \ J W) - <?00l 

It then remains to be shown that the result is independent of the choice of p. This 
easily follows by observing that any other p\ fulfilling the required properties must be 
absolutely continuous with respect to p, and vice versa, when restricted to the sum of 
the supports of p and q: 

pi = pp {i e supp(p) U supp(g)} 

with p a strictly positive function, and 

pi(i) =p(i)/p(i), 

and likewise for q. Clearly, d^ is independent of p. 

In fact, it follows from classical work by Riesz on measure theory that it is not 
even necessary to assume the existence of a "reference" measure fi, as our definition 
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of di, is simply a special case of Riesz's definition of the norm on the space of additive 
functions of sets. 

Using the fact that probability density function is non-negative we can write the 
distance 4 as 

1 

r 





p --l 




IX 


q 



dx = -E q [\K-l\]dx, (33) 



where A = | is the likelihood ratio. 



3.3. Non-uniform priors 



It is not difficult to generalize the discussion to the case when the two 
processes, described by probability distributions p(i) and q(i), are no longer treated 
symmetrically. When the assumption od equal a priori probabilities is relaxed, the 
total probability of outcome i is now given by 

Pp(i) + Qq(i) 

with P + Q = 1. Let us now think in terms of searching for some "signal", whose 
presence leads to measurements distributed according to p(i), versus the "pure noise" 
described by q(i). The signal is more likely than not to be present for outcomes that 
fall within the set 

V = {iGl: Pp(i) > Qq(i)}. 

As long as we can assume that q(i) is nonzero for all i E I, this inequality may be 
written in terms of the likelihood ratio p/q: 

p M> k 

9(0 

where the " detection threshold" k is given in terms of the a priori probabilities P and 
Q: 



and conversely 



k- Q - l - P 



1 „ k 



k + 1 k+1 

It is usual in signal detection theory to consider separately the ,,false alarm 
probability" (the probability that a decision rule mistakenly leads us to believe the 
signal to be present), and the ,, false dismissal probability" (that the signal might be 
missed when it is in fact present). Instead, we restrict ourselves here to considering, 
as before, the total probability of an erroneous decision PE. In the current model, 
PE is given by: 

PE = J2^(Pp(i),Q<lii)) = l ll-^2\Pp{i)-Qq{i)\) ■ 

iei \ iei ) 

While this can no longer be expressed in terms of di,(p,q), contrary to the case of 
equal a priori probabilities, the expression above still involves the Li-norm of the 
difference between the two (non-normalized) densities (measures) Pp and Qq. 

Let us now consider two different signals, described by distributions p\ and P2, 
which are close to each other in the sense of the L\ distance (i.e. di(pi,p2) < e f° r 
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some small e), and a pure noise described by q. A simple application of the triangle 
inequality for the Li-norm leads to the inequality 

E \ p P* - Qi\ < rr7 + E - Offl- 

Note that in the case of equal a priori probabilities the above simply reduces to the 
triangle inequality: 

d L (p2,q) <e + d L { Pl ,q). (34) 

In other words: when two signals (or rather, their corresponding data distributions) 
differ by no more than e in term of the L\ distance, the probabilities of confusing 
presence of the signal with pure noise (under equal detection threshold) differ by a 
term which is as well of order e. This property makes the distance dh appropriate 
for definition of covering radius in the construction of the grid of templates. When 
probability distribution p\ corresponds to the true signal and P2 to the signal with 
parameters at the node of the grid the inequality (f3~4"|) tells us that the error probability 
will not increase by more than e. Likewise in the problem of designing suboptimal 
templates that approximate the true signal the inequality (|34[) says that when distance 
dz, of the signal to the template is less than e the probability of error does not increase 
by more than e. 



4. Kullback-Leibler divergence 

A useful measure of distance between the two probability measures p(x) and q{x) was 
defined by Kullback and Leibler [10] . The Kullback - Leibler divergence dx l (p, q) is 
defined as 

d KL (p,q)=E p [log^]+E q [log^]. (35) 

Using the likelihood ratio A = | we can write the Kullback - Leibler divergence as 

d KL = E p [log A] - E q [log A] . (36) 

In Bayesian statistics the KL divergence can be used as a measure of the "distance" 
between the prior distribution and the posterior distribution. In coding theory, the 
KL divergence can be interpreted as the needed extra message-length per datum for 
sending messages distributed as q, if the messages are encoded using a code that is 
optimal for distribution p. 



5. Examples 

In this Section we shall calculate the distance dj, and the Kullback-Leibler divergence 
dKL in several cases useful for applications. 



5.1. d^ distance 

5.1.1. Gaussian probability density function. Let p and q be Gaussian probability 
density functions with mean [i and v respectively and the same variance a 2 then the 
distance dh is given by 
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where erf is the error function defined by Eq. 

For the case of two arbitrary Gaussian probability density functions p and q with 
means fj, p , \i q and variances a p , a q respectively we have 

d L = 77 1 er /[— — i — ] - er f[— — z — ] _ ( er fi— — z — ] _ er fr 

where 



2 I V2 ^ CTp Jl y/2 a q ' L V2 



Xl,2 


-b ± V& 2 - 4ac 
2a 


(39) 


a 


1 1 

2 2 ' 
CTp CT^ 


(40) 


b 


= -2(^-^|), 

CTp CT^ 


(41) 






c 


= 4-4-21n^. 

CTp CT^ CTp 


(42) 



Let p(x) and o(x) be two n— dimensional multivariate Gaussian probability 
density functions with vector means /i and v respectively and the same covariance 
matrix X. Thus p(x) is given by 

"^ W/U '*!"^ 1 '"""- 1 * (43) 

where ' denotes transpose and similarly g(x). To calculate the distance we first 
perform a change of variables so that the covariance matrix £ is diagonal and the 
diagonal elements are equal. We can then rotate the vector so that it is aligned along 
the x\ axis. These transformations bring the calculation of distance c?l(p, q) to one 
dimensional case. Consequently we have 

d L = erA^fi V^-^'S- 1 ^-^)] ■ (44) 

5.1.2. Stationary Gaussian process. Let us consider the case of signals s% and 
S2 added to a stationary Gaussian random process. We then have two Gaussian 
probability density p\ or p 2 when respectively the signal si or the signal s 2 is present. 
We can obtain the distance dL{pi,p 2 ) as the limit of the case of the multivariate 
Gaussian distributions by replacing (/i — i/)'E _1 (/x — v) with (s\ — S2\s\ — S2) where 
the scalar product ( • | • ) is defined by Eq.[3l Thus in this case we have 

1 



d L {si,s 2 ) = er /[^/| V( s i ~ S2\si - s 2 )]. (45) 

We have introduced a notation 

dL(si,s 2 ) = d L (p 1 ,p 2 ), (46) 

where p\,P2 are probability density distributions when signal Sx,S2 respectively are 
present in the data. In the case of detection of signal in noise when we have two 
Gaussian probability density functions p\ when the signal s is present and po when 
the signal is absent the distance dz,(po,Pi) is immediately obtained form Eq. (|4"5)) 
above: 

1 



d L = erf[-^=y/W)]. (47) 
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Finally suppose that p and q belong to the same family pg (x) of probability density 
functions parameterized by parameter 0. Let p(x) = pg{x) and q(x) = pe+se(x) where 
SO small. Then using Taylor expansion to the first order we have 



d L (0,0 + S0) 







L 


del 59i 



dx. 



where we have introduced a short hand notation 
d L (e,e') = d L ( Pe ,pe>). 



(48) 



(49) 



5.2. dpcL divergence 

5.2.1. Gaussian probability density junction. For the case of two Gaussian probability 
density functions with means fi p , n q and variances er|, a 2 respectively we have 



lK-^) 2 + K + ^)(M P -M g ) 2 



9 9 



(50) 



When the two Gaussian probability distributions have the same variance equal to a 2 
the above formula reduces to 

dKL = ^ZJ^\ (51) 

For the case of n— dimensional multivariate Gaussian probability density functions p 
and q the divergence dx l (p, q) is given by 

d KL (p,q) = (»-vyE-\n-v). (52) 



5.2.2. Stationary Gaussian process. In the case of detection of signal in Gaussian 
stationary noise we immediately obtain the Kullback-Leibler divergence dxL between 
the probability density functions po and p\ when respectively signal is absent and 
present using the Cameron-Martin formula @. In this case we have 

d KL = (s\s). (53) 

Thus in this case the Kullback-Leibler divergence is precisely equal to the signal-to- 
noise ratio square. Suppose that p and q belong to the same family pe{x) of probability 
density functions parameterized by parameter 6. Let p(x) = pe{x) and q(x) — pe+6o(%) 
where SO small. Then one can show by Taylor expansion that to the first order the 
KL-divergence dKL(6, + SO) between pg and pg+se is given by 

ds 2 = d KL (0,0 + SO) = TijSOiSOj. (54) 

From Eqs. (|53[) and (j54"| we see that the Kullback-Leibler divergence is directly 
related to the basic quantities used in detecting signals in noise and estimating their 
parameters - the signal-to-noise ratio and the Fisher information matrix. Also the 
equation (|54|) reinforces the interpretation of the Fisher information matrix as a 
Riemannian metric on the parameter space [4] as the square root of the Kullback- 
Leibler divergence of probability density functions of closely spaced parameters is the 
line element ds for the Fisher metric T. 
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6. Comparing the two distances 

In contrast to the distance dj, the Kullback-Leibler divergence is not a metric because 
it does not fulfill the triangle inequality. The distance dz, has the advantage over the 
Kullback-Leibler divergence that it exists even if the two probability measures are not 
absolutely continuous with respect to each other. If the probability measure p is not 
absolutely continuous with respect to q the divergence dxh does not exist. 

Let us compare the Lx-norm distance with the Kullback-Leibler divergence and 
also with the line element ds defined by the Fisher matrix (Eg. [51)1 for the case of a 
monochromatic signal. Let us calculate the norm N — (sj — s 2 \si — s 2 ) == ( s il s i) + 
(S2IS2) — 2(si|s2) where si and s 2 be two monochromatic signals with amplitudes Ai 
and A2, phases 4>i and (f>2 and angular frequencies u>i and u>2 respectively. Assuming 
that over the bandwidth [uj\ UJ2] spectral density is constant and equal to S , using 
Parseval's theorem, and assuming that the observation time T is much longer than 
the period 2tt/uj we have 

N = p\ + pi, - 2pip 2 [< cos(Aujt) > cos(A</>)- < sin(Awi) > sin(A($$. 
where p\, p 2 are signal-to-noise ratios for signals si and S2 respectively, Auj = lo\ — uj 2 , 
A(f> = 01-02- The operator < • > is defined by Eq. ([20]). The Kullback-Leibler 
divergence dKh{si, s 2 ) between the Gaussian probability density functions p\ and P2 
for signals si and S2 and the distance di(si, S2) are given by (see Eq. (H5J): 

d KL = N, (56) 

dL =erf(^=VN). (57) 

Let us assume that the two signals si and S2 have the same amplitudes and phases. 
Then we have p\ = P2 = p and the distance dL and the divergence dKL are given by 

^=2^(1-^2), (58) 

d L =er/(^=v^T). (59) 

The line element ds 2 defined by the Fisher matrix is given by the first non-vanishing 
term of the Taylor expansion of d^L m Aw. 

ds = Vr^Aw) 2 = (60) 

where T uloUJo is the component of the Fisher matrix given by Eq. (f30|) . It is clear 
form Eqs. (I37p and ()51|) that it is appropriate to compare Li-norm distance with 
a square root Kullback-Leibler divergence. In Figure [T] we have plotted ds, vd/fLi 
and dL as functions of frequency difference expressed in Fourier bins. A Fourier bin 
is equal to X/T, We see from Figure Q] that for frequency difference larger than a 
quarter of a Fourier bin the distance ds based on the Fisher matrix begins to deviate 
substantially from the Kullback-Leibler divergence. This shows limitations of the 
applicability of the Fisher matrix. This is a consequence of the fact that the Fisher 
matrix is obtained by a Taylor expansion up to the second order terms of the Kullback- 
Leibler divergence. As we shall see in the Section [7T2l below we can apply the distance 
measures to construction of a grid of templates by considering S\ as a signal and 
S2 as a template. Taylor expansion may be a good approximation for a very fine 
grid. However for computationally intensive searches like a search for periodic sources 
[T21 IT5] where one needs to choose a loose grid the Taylor expansion of KL-divergence 
up to second order terms is not accurate. 
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Figure 1. Comparison of the Li-norm distance, the Kullback-Liebler divergence, 
and the line element defined by the Fisher matrix for the case of two 
monochromatic signals with the same amplitudes and phases but different 
frequencies as functions of frequency difference expressed in Fourier bins. 



7. Applications 

Using the example of the monochromatic signal let us consider several applications of 
the distance g?l to problem of detection of signal in noise and estimation of parameters. 



7.1. Signal resolution 

The distance c?l(si, S2) determines how well we can resolve two signals si and S2- The 
larger the distance the better the signal resolution. It is useful to have a reduced form 
of the distance that depends only on the frequencies lo\ and oj2- We can achieve a 
reduction of the phase parameter by considering the worst case i.e. the minimum of 
the distance d^ given by Eqs. (|55|) and (jFTj) over the phase difference A(j>. One easily 
finds an analytic formula for this minimum. 



= erf 



1 



2V2 



Pi + Pi - 2piP2\ 



' 2(1 -cos(AuT)) 
(AluT) 2 



(61) 
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Figure 2. Li-norm distance between Gaussian probability density functions for 
two monochromatic signals as a function of their frequency difference. We assume 
that the signal-to-noise ratios of both signals are equal to 3. 

There is no nontrivial minimum with respect to amplitudes (the minimum of dhmin 
with respect to amplitudes A\ and A 2 is zero). In Figure [2] we plot the distance dh 
given by Eq. (|6ip as a function of the difference in angular frequencies Au> expressed 
in Fourier bins. We assume that signal-to-noise ratios of the signals si and S2 are 
equal and equal to 3. We see that the distance increases as the difference between 
the frequencies increases. We also see that there is a substantial increase in the 
distance when the difference in frequencies between the two signal becomes one bin. 
This characteristic increase over one bin is independent of the signal-to-noise ratios 
of the signals. This justifies a folk theorem that two monochromatic signals can be 
resolved when their frequencies differ by one bin. In Figure [3] we have plotted the 
distance as a function of signal-to-noise ratios of the two signals for a fixed difference 
between the frequencies of the signals equal to one bin. We see that the distance 
dh increases as the signal-to-noise ratio increases and also as the difference between 
the signal-to-noise ratios of the two signals increases. We see that we can achieve an 
arbitrary large distance and consequently arbitrary large resolvability of the signal if 
its signal-to-noise ratio is sufficiently large. 
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Figure 3. Li-norm distance as a function of signal-to-noise ratios pi and p2 of 
two signals for a fixed difference between the frequencies of the signal equal to 
one bin. 



7. 2. Grid of templates 

We can use the distance di, in construction of the grid of templates. We construct the 
grid of templates in such a way that the maximum distance between a signal and a filter 
is less than a certain specified value. Let us first calculate the distance dr / (s, sp) where 
s is the signal with parameters A ,4> ^uj and sp is the template. The template sp 
is just a monochromatic signal with some test parameters Ap,(f>p,ujp. From Section 
2 we know that the optimal statistic T is invariant with respect to transformation of 
the amplitude and the phase of the filter and we can set the amplitude of the filter 
Ap so that its signal-to-noise ratio is equal to f and we can choose the phase of the 
filter cj)p to be equal to 0. With these simplifications the distance dh is given by 

d-L = erf I — —y / p 2 + l — 2p[< cos(Aujt) > cos(</> )— < sin(Au)t) > sin(0 o )] 

Since the ^-statistic depends only on the frequency we only need the grid in the 
frequency space and consequently we need to obtain a reduced distance that depends 
only on the frequencies of the signal and filter. Again like in the case of signal resolution 
it is natural to consider the worst case scenario. In this case it corresponds to maximum 




Figure 4. Li-norm distance maximized over phase as a function of frequency 
spacing for a monochromatic signal. The signal-to-noise ratio is set to 8. 



of the distance d^ with respect to phase of the signal 4> Q . We easily get 



LLvnax 



= erf 



2V2 



l + 2p\ 



12(1 - cos(AcjT)) 
(AluT) 2 



(63) 



The function dhmax for signal-to- noise ratio p > is a monotonically increasing 
function of p. We choose the grid using the distance dr, to determine the covering 
radius of the grid. To calculate the covering radius we can set the signal-to-noise ratio 
p in Eq. (f6"3"|) equal to the threshold value of p used in the search. In Figure 2] we plot 
the distance d^max &s a function of the frequency difference Aw between the signal 
and the filter for signal-to-noise equal to 8. 



7.3. Search templates 

Very often we do not have the exact model of the signal that we are searching for. Let 
us suppose that the signal that we expect to detect in noise is linearly modulated in 
frequency and has the following form 

s = A cos(uj t + uiit 2 +4> ). (64) 

Let us also suppose that we know that u>\ is small and we can expect to detect the 
signal (|64[) with a monochromatic signal template that has no frequency modulation. 
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The distance d^ is given by 



1 



where 



d L = erf VP 2 + 1 - 2p[Scos(A0) - Csin(A^)] j , (65) 

5 =< sin(Awt + wxt 2 ) >, (66) 
C =< cos{Aujt + ujit 2 ) > . (67) 

(68) 

To determine the quality of the search template we need to find the minimum of the 
distance with respect to parameters of the filter. The minimum of d^ with respect to 
phase (t>p of the filter can easily be obtained and is given by 



erf[^=^p 2 +p%- 2pp F y/S* + &]. (69) 

is independent of the phase of <f> of the signal. The 
minimum is a monotonic function of p. Consequently to find the minimum of d^ with 
respect to the frequency parameter wp of the filter is equivalent to find the maximum 
of the function 

FF = V-S 2 + C 2 (70) 

with respect to (Jp. In Figure [5] we have plotted the distance (|69[) as a function of 
the frequency difference between the template and the signal expressed in frequency 
bins. For the case of perfectly matched filter (u>i = 0) the distance would have a 
minimum equal to for Atu = 0. For non-zero value of u>\ the minimum distance is 
larger than zero and it occurs for a certain frequency of the filter biased with respect 
to true frequency of the signal. 



8. Conclusion 



We have introduced the Li-norm distance dj, {pi,P2) between probability density 
functions p\ and p2- The Li-norm provides a notion of distance between probability 
distributions endowed with a clear probabilistic interpretation, as discussed in Section 
3. The Li-norm distance can be a useful tool in gravitational wave data analysis for 
studying the problem of signal resolution, template placements, and design of search 
templates. In a future paper we shall study these problems for realistic gravitational 
wave signals from supernovae, inspiralling binaries, rotating neutron stars and white 
dwarf binaries. 
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